Open Tabs
- Classification_Dsci.ipynb
- Untitled1.ipynb
- Time series analysis.ipynb
- Imbalance Data.ipynb
Kernels
- Classification_Dsci.ipynb
- Imbalance Data.ipynb
- Untitled1.ipynb
- Time series analysis.ipynb
Terminals
Warning
The JupyterLab development team is excited to have a robust
third-party extension community. However, we do not review
third-party extensions, and some extensions may introduce security
risks or contain malicious code that runs on your machine.
Installed
Discover
/
Name
...
Last Modified
Kernel status: Idle
[2]:
import matplotlib.pyplot as plt[3]:
iris = pd.read_csv(r'C:\Users\user\Desktop\iris (3).csv')[4]:
print(iris.shape)(150, 6)
[5]:
iris.head()[5]:
| id | sepal_len | sepal_wd | petal_len | petal_wd | species | |
|---|---|---|---|---|---|---|
| 0 | 0 | 5.1 | 3.5 | 1.4 | 0.2 | iris-setosa |
| 1 | 1 | 4.9 | 3.0 | 1.4 | 0.2 | iris-setosa |
| 2 | 2 | 4.7 | 3.2 | 1.3 | 0.2 | iris-setosa |
| 3 | 3 | 4.6 | 3.1 | 1.5 | 0.2 | iris-setosa |
| 4 | 4 | 5.0 | 3.6 | 1.4 | 0.2 | iris-setosa |
[6]:
iris.drop('id', axis = 1, inplace = True)[7]:
iris.head()[7]:
| sepal_len | sepal_wd | petal_len | petal_wd | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | iris-setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | iris-setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | iris-setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | iris-setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | iris-setosa |
[8]:
#summarry statistics[8]:
| sepal_len | sepal_wd | petal_len | petal_wd | |
|---|---|---|---|---|
| count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
| mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
| std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
| min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
| 25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
| 50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
| 75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
| max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
[9]:
missing_values = iris.isnull().sum()sepal_len 0 sepal_wd 0 petal_len 0 petal_wd 0 species 0 dtype: int64
[10]:
#check the data type of the datasetsepal_len float64 sepal_wd float64 petal_len float64 petal_wd float64 species object dtype: object
[11]:
#No missing value and all the features in the dataset are numeric, so we conclude that the dataset is clean[12]:
class_distr = iris['species'].value_counts()iris-setosa 50 iris-versicolor 50 iris-virginica 50 Name: species, dtype: int64
[13]:
import matplotlib.pyplot as plt[14]:
#This gives us a much clearer idea of the distribution of the input variable, showing that both sepal len and sepal width have a normal(Gaussian) distribution.[15]:
#check correlation between the features[16]:
#From the above visualization it is disfficult to seperate "iris versicolor" from "iris virginica" because of the overlap of petal len & wd and also sepal len & wd.[17]:
pd.plotting.scatter_matrix(iris)[18]:
#correlation_matrix = iris['sepal_wd', 'sepal_len','petal_wd', 'petal_len'].corr()[19]:
#We identify that the length and the width are the most useful features to separate the species.[32]:
#Data preparation[40]:
#and test sets have approximately the same percentage of samples of each target class as the complete set.[44]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1, stratify = Y)[45]:
print(Y_train.value_counts())iris-setosa 35 iris-virginica 35 iris-versicolor 35 Name: species, dtype: int64 iris-virginica 15 iris-setosa 15 iris-versicolor 15 Name: species, dtype: int64
[46]:
#MODELLING[47]:
from sklearn.neighbors import KNeighborsClassifier[48]:
#Now create an instance knn from the class KNeighborsClassifier[53]:
knn = KNeighborsClassifier(n_neighbors = 5)[54]:
#Note that the only parameter we need to set in this problem is n_neighbors, or k as in knn.[55]:
#Use the data X_train and Y_train to train the model[56]:
#FITTING[59]:
print(knn.fit(X_train, Y_train))KNeighborsClassifier()
[60]:
#We use most the default values for the parameters, e.g, metric = 'minkowski'[61]:
#sklearn.neighbors.KNeighborsClassifier(n_neighbors = 5, weights = 'uniform', algorithm = 'auto', leaf_size = 30, p =2, metric = 'minkowski', metric_params = npne, n_jobs = none)[62]:
#LABEL PREDICTION[63]:
#To make a prediction in a sckit learn, we can call the method predict(). we are trying to predict the species of iris[64]:
#Let's make the prediction on the test data set and save the output in pred for later review[65]:
pred = knn.predict(X_test)[66]:
#Let rewview the first five prediction[71]:
print(pred[:1])['iris-virginica']
[74]:
#PROBABILITY PREDICTION[75]:
#Of all classification algorithms implemented in scikit learn, there is an addittional method 'predict_prob'.[84]:
y_pred_prob = knn.predict_proba(X_test)[[0. 0. 1.] [1. 0. 0.] [1. 0. 0.] [0. 1. 0.] [0. 1. 0.]]
[82]:
print(pred[:5])['iris-virginica' 'iris-setosa' 'iris-setosa' 'iris-versicolor' 'iris-versicolor']
[ ]:
#The 1st is predicted to be iris_virgi and the Kernel status: Idle
[4]:
xxxxxxxxxx#Step 1: Import Libraries[2]:
xxxxxxxxxximport pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom statsmodels.tsa.stattools import adfullerfrom statsmodels.tsa.seasonal import STLfrom statsmodels.tsa.arima.model import ARIMAfrom statsmodels.tsa.holtwinters import ExponentialSmoothing[5]:
xxxxxxxxxx#Step 2: Load and Visualize Data#Load your time series data into a pandas DataFrame. Ensure that the time column is in the proper datetime format. Use matplotlib to visualize the data.[11]:
xxxxxxxxxx# Load your time series datadata = pd.read_csv(r"C:\Users\user\Desktop\AirPassengers.csv")# Convert the date column to a datetime object (if not already)data['Month'] = pd.to_datetime(data['Month'])# Set the date column as the DataFrame indexdata.set_index('Month', inplace=True)# Visualize the time series dataplt.figure(figsize=(10, 6))plt.plot(data.index, data['Passengers'], label='Time Series Data')plt.xlabel('Date')plt.ylabel('Passengers')plt.title('Time Series Data')plt.legend()plt.show()[22]:
xxxxxxxxxxprint(data)Passengers Month 1949-01-01 112 1949-02-01 118 1949-03-01 132 1949-04-01 129 1949-05-01 121 ... ... 1960-08-01 606 1960-09-01 508 1960-10-01 461 1960-11-01 390 1960-12-01 432 [144 rows x 1 columns]
[23]:
#Step 3: Check for Stationarity#Stationarity is a crucial assumption in many time series models. We can use the Augmented Dickey-Fuller test to check for stationarity.[24]:
xxxxxxxxxxdef check_stationarity(series): result = adfuller(series) print('ADF Statistic:', result[0]) print('p-value:', result[1]) print('Critical Values:') for key, value in result[4].items(): print(f' {key}: {value}')check_stationarity(data['Passengers'])ADF Statistic: 0.8153688792060497 p-value: 0.991880243437641 Critical Values: 1%: -3.4816817173418295 5%: -2.8840418343195267 10%: -2.578770059171598
[25]:
xxxxxxxxxx#If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis, indicating that the data is stationary.[26]:
xxxxxxxxxx#Step 4: Detrending[27]:
xxxxxxxxxx#If your data has a clear trend, you can detrend it to make it stationary. One way to do this is by differencing the series.[28]:
xxxxxxxxxxdata['Detrended'] = data['Passengers'] - data['Passengers'].shift(1)data.dropna(inplace=True)plt.figure(figsize=(10, 6))plt.plot(data.index, data['Detrended'], label='Detrended Data')plt.xlabel('Date')plt.ylabel('Detrended Value')plt.title('Detrended Time Series Data')plt.legend()plt.show()[29]:
xxxxxxxxxx#Step 5: Seasonality Analysis[30]:
xxxxxxxxxx#To identify and remove seasonality, we can use seasonal decomposition of time series (STL).[31]:
xxxxxxxxxxseasonal_decomp = STL(data['Passengers'], seasonal=13)result = seasonal_decomp.fit()plt.figure(figsize=(10, 8))plt.subplot(4, 1, 1)plt.plot(data.index, result.trend, label='Trend')plt.xlabel('Date')plt.ylabel('Trend')plt.legend()plt.subplot(4, 1, 2)plt.plot(data.index, result.seasonal, label='Seasonal')plt.xlabel('Date')plt.ylabel('Seasonal')plt.legend()plt.subplot(4, 1, 3)plt.plot(data.index, result.resid, label='Residuals')plt.xlabel('Date')plt.ylabel('Residuals')plt.legend()plt.subplot(4, 1, 4)plt.plot(data.index, result.observed, label='Original')plt.xlabel('Date')plt.ylabel('Original')plt.legend()plt.tight_layout()plt.show()[32]:
xxxxxxxxxx#Step 6: Smoothing[33]:
xxxxxxxxxx#Smoothing techniques like moving averages or exponential smoothing can help remove noise and emphasize patterns.[36]:
xxxxxxxxxxdata['Smoothed'] = data['Passengers'].rolling(window=7).mean() # Moving average with window size 7plt.figure(figsize=(10, 6))plt.plot(data.index, data['Passengers'], label='Original Data')plt.plot(data.index, data['Smoothed'], label='Smoothed Data')plt.xlabel('Date')plt.ylabel('Passengers')plt.title('Original and Smoothed Time Series Data')plt.legend()plt.show()[37]:
xxxxxxxxxx#Step 7: Autocorrelation and Partial Autocorrelation[39]:
xxxxxxxxxxfrom statsmodels.graphics.tsaplots import plot_acf, plot_pacfplt.figure(figsize=(12, 4))plt.subplot(1, 2, 1)plot_acf(data['Detrended'], ax=plt.gca(), lags=20)plt.subplot(1, 2, 2)plot_pacf(data['Detrended'], ax=plt.gca(), lags=20)plt.tight_layout()plt.show()[40]:
xxxxxxxxxx#Step 8: Choose and Fit a Model[41]:
xxxxxxxxxx#Based on the autocorrelation and partial autocorrelation plots, select the orders for the ARIMA model.[43]:
xxxxxxxxxx# Assume ARIMA(1, 0, 1) as an examplemodel = ARIMA(data['Passengers'], order=(1, 0, 1))result = model.fit()# Print the model summaryprint(result.summary())C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
SARIMAX Results
==============================================================================
Dep. Variable: Passengers No. Observations: 143
Model: ARIMA(1, 0, 1) Log Likelihood -696.484
Date: Sat, 22 Jul 2023 AIC 1400.967
Time: 23:18:07 BIC 1412.819
Sample: 02-01-1949 HQIC 1405.783
- 12-01-1960
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 281.4780 57.109 4.929 0.000 169.547 393.410
ar.L1 0.9365 0.028 33.290 0.000 0.881 0.992
ma.L1 0.4261 0.076 5.595 0.000 0.277 0.575
sigma2 974.9441 114.826 8.491 0.000 749.890 1199.999
===================================================================================
Ljung-Box (L1) (Q): 0.05 Jarque-Bera (JB): 1.84
Prob(Q): 0.82 Prob(JB): 0.40
Heteroskedasticity (H): 6.69 Skew: 0.27
Prob(H) (two-sided): 0.00 Kurtosis: 3.16
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[44]:
xxxxxxxxxx#Step 9: Model Validation[45]:
xxxxxxxxxx#Split the data into training and testing sets, fit the model on the training data, and validate its performance on the testing data.[48]:
xxxxxxxxxxtrain_size = int(len(data) * 0.8)train, test = data.iloc[:train_size], data.iloc[train_size:]model = ARIMA(train['Passengers'], order=(1, 0, 1))result = model.fit()# Forecast on the test setforecast_values = result.forecast(steps=len(test))# Calculate the Mean Absolute Error (MAE) for validationmae = np.mean(np.abs(forecast_values - test['Passengers']))print(f'Mean Absolute Error (MAE): {mae:.2f}')C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq)
Mean Absolute Error (MAE): 105.61
[49]:
xxxxxxxxxx#Step 10: Forecasting[50]:
xxxxxxxxxx#Once you have a well-fitted model, you can use it to forecast future values.[59]:
xxxxxxxxxx# Re-fit the model on the entire datasetmodel = ARIMA(data['Passengers'], order=(1, 0, 1))result = model.fit()# Forecast future valuesforecast_steps = 12 # For example, forecast 12 steps into the futureforecast_values = result.forecast(steps=forecast_steps)# Plot the original data and the forecasted valuesplt.figure(figsize=(10, 6))plt.plot(data.index, data['Passengers'], label='Original Data')plt.plot(pd.date_range(data.index[-1], periods=forecast_steps + 1, closed='right'), forecast_values, label='Forecasted Passengers', color='orange')plt.xlabel('Date')plt.ylabel('Passengers')plt.title('Original Data and Forecasted Passengers')plt.legend()plt.show()C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. self._init_dates(dates, freq) C:\Users\user\AppData\Local\Temp\ipykernel_9592\528828871.py:12: FutureWarning: Argument `closed` is deprecated in favor of `inclusive`. plt.plot(pd.date_range(data.index[-1], periods=forecast_steps + 1, closed='right'), forecast_values, label='Forecasted Passengers', color='orange')
[61]:
xxxxxxxxxxprint("Congratulations! You have completed an in-depth time series analysis. You've learned how to load and visualize time series data, check for stationarity, detrend the data, analyze seasonality, smooth the data, perform autocorrelation and partial autocorrelation analysis, choose and fit an ARIMA model, validate the model, and make future forecasts.Time series analysis is a powerful tool for understanding and forecasting data with temporal patterns. However, keep in mind that this tutorial only scratches the surface of the vast field of time series analysis. There are many more advanced techniques and models to explore, such as seasonal ARIMA (SARIMA), seasonal decomposition of time series with trend and seasonality (STL-ATS), state-space models, and machine learning approaches like LSTM (Long Short-Term Memory) networks.Remember that the effectiveness of time series analysis depends on the quality of your data, the appropriateness of the selected model, and the accuracy of the forecasting assumptions. Always validate your results, and consider the context and domain knowledge when interpreting the outcomes.As you continue to explore time series analysis, I encourage you to work on different datasets, experiment with various models and techniques, and stay updated with the latest advancements in the field. This will help you become more proficient in analyzing and forecasting time series data for various real-world applications. Happy analyzing!")Congratulations! You have completed an in-depth time series analysis. You've learned how to load and visualize time series data, check for stationarity, detrend the data, analyze seasonality, smooth the data, perform autocorrelation and partial autocorrelation analysis, choose and fit an ARIMA model, validate the model, and make future forecasts.Time series analysis is a powerful tool for understanding and forecasting data with temporal patterns. However, keep in mind that this tutorial only scratches the surface of the vast field of time series analysis. There are many more advanced techniques and models to explore, such as seasonal ARIMA (SARIMA), seasonal decomposition of time series with trend and seasonality (STL-ATS), state-space models, and machine learning approaches like LSTM (Long Short-Term Memory) networks.Remember that the effectiveness of time series analysis depends on the quality of your data, the appropriateness of the selected model, and the accuracy of the forecasting assumptions. Always validate your results, and consider the context and domain knowledge when interpreting the outcomes.As you continue to explore time series analysis, I encourage you to work on different datasets, experiment with various models and techniques, and stay updated with the latest advancements in the field. This will help you become more proficient in analyzing and forecasting time series data for various real-world applications. Happy analyzing!
[ ]:
xxxxxxxxxx Kernel status: Idle Executed 1 cellElapsed time: 1 second
[2]:
xxxxxxxxxx#Sure, lets use an example with imbalanced data. In this scenario, we \n'll use the \n"Credit Card Fraud Detection \n" dataset from Kaggle. This dataset contains transactions made by credit cards, and the goal is to detect fraudulent transactions, which are typically a very small proportion of the total transactions, making the data imbalanced.[20]:
xxxxxxxxxximport numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, classification_report, confusion_matrix# Load the credit card fraud datasetdata = pd.read_csv(r"C:\Users\user\Desktop\creditcard.csv")# Explore the datasetprint(data['Class'].value_counts())print(data['Class'].head())#we have 284315 for 0 and 492 for 1, which show that the data is imbalance "note: we choose 'Class column as our target variable0 284315 1 492 Name: Class, dtype: int64 0 0 1 0 2 0 3 0 4 0 Name: Class, dtype: int64
[6]:
#Step 2: Prepare the data for modeling.[21]:
xxxxxxxxxx# Separate features and target variableX = data.drop('Class', axis=1) #we drop claas column from the variables y = data['Class'] #we choose class our target variable.# Split the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Standardize the featuresscaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)[10]:
#Step 3: Choose a machine learning algorithm (Logistic Regression) and train the model.[6]:
# Create and train a Logistic Regression classifierlogistic_model = LogisticRegression(random_state=42)logistic_model.fit(X_train, y_train)[6]:
LogisticRegression(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(random_state=42)
[12]:
#Step 4: Make predictions on the test data and evaluate the model.[13]:
# Make predictions on the test datay_pred = logistic_model.predict(X_test)# Evaluate the modelaccuracy = accuracy_score(y_test, y_pred)print("Accuracy:", accuracy)# Print the classification report for more detailed evaluationprint(classification_report(y_test, y_pred))# Confusion Matrixconf_matrix = confusion_matrix(y_test, y_pred)print("Confusion Matrix:")print(conf_matrix)Accuracy: 0.9991222218320986
precision recall f1-score support
0 1.00 1.00 1.00 56864
1 0.86 0.58 0.70 98
accuracy 1.00 56962
macro avg 0.93 0.79 0.85 56962
weighted avg 1.00 1.00 1.00 56962
Confusion Matrix:
[[56855 9]
[ 41 57]]
[14]:
xxxxxxxxxx#Step 5: Handle imbalanced data using techniques like "class weights" or "resampling."[16]:
xxxxxxxxxx# Option 1: Using class weightslogistic_model_weighted = LogisticRegression(class_weight='balanced', random_state=42)logistic_model_weighted.fit(X_train, y_train)# Option 2: Using resampling techniques like SMOTE (Synthetic Minority Over-sampling Technique)from imblearn.over_sampling import SMOTEsmote = SMOTE(random_state=42)X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)logistic_model_resampled = LogisticRegression(random_state=42)logistic_model_resampled.fit(X_train_resampled, y_train_resampled)[16]:
LogisticRegression(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(random_state=42)
[19]:
xxxxxxxxxxprint("By handling the imbalanced data, you improve the model's performance in detecting the minority class (fraudulent transactions). Techniques like using class weights or resampling can help mitigate the impact of imbalanced data on the model \n's training and lead to more accurate predictions for the minority class. Keep in mind that imbalanced data is a common challenge in machine learning, and there are various other techniques and algorithms designed to address this issue, such as using different evaluation metrics (e.g., ROC-AUC, precision-recall curves) or employing ensemble methods like Random Forest and Gradient Boosting, which are often robust to imbalanced data.")By handling the imbalanced data, you improve the model's performance in detecting the minority class (fraudulent transactions). Techniques like using class weights or resampling can help mitigate the impact of imbalanced data on the model 's training and lead to more accurate predictions for the minority class. Keep in mind that imbalanced data is a common challenge in machine learning, and there are various other techniques and algorithms designed to address this issue, such as using different evaluation metrics (e.g., ROC-AUC, precision-recall curves) or employing ensemble methods like Random Forest and Gradient Boosting, which are often robust to imbalanced data.
[20]:
xxxxxxxxxx#Step 6: Evaluate the models with imbalanced data handling.[21]:
xxxxxxxxxx# Option 1: Using class weightsy_pred_weighted = logistic_model_weighted.predict(X_test)print("Model with Class Weights:")accuracy_weighted = accuracy_score(y_test, y_pred_weighted)print("Accuracy:", accuracy_weighted)print(classification_report(y_test, y_pred_weighted))conf_matrix_weighted = confusion_matrix(y_test, y_pred_weighted)print("Confusion Matrix:")print(conf_matrix_weighted)# Option 2: Using resampling with SMOTEy_pred_resampled = logistic_model_resampled.predict(X_test)print("Model with Resampling (SMOTE):")accuracy_resampled = accuracy_score(y_test, y_pred_resampled)print("Accuracy:", accuracy_resampled)print(classification_report(y_test, y_pred_resampled))conf_matrix_resampled = confusion_matrix(y_test, y_pred_resampled)print("Confusion Matrix:")print(conf_matrix_resampled)Model with Class Weights:
Accuracy: 0.9763702117200941
precision recall f1-score support
0 1.00 0.98 0.99 56864
1 0.06 0.92 0.12 98
accuracy 0.98 56962
macro avg 0.53 0.95 0.55 56962
weighted avg 1.00 0.98 0.99 56962
Confusion Matrix:
[[55526 1338]
[ 8 90]]
Model with Resampling (SMOTE):
Accuracy: 0.9745970998209332
precision recall f1-score support
0 1.00 0.97 0.99 56864
1 0.06 0.92 0.11 98
accuracy 0.97 56962
macro avg 0.53 0.95 0.55 56962
weighted avg 1.00 0.97 0.99 56962
Confusion Matrix:
[[55425 1439]
[ 8 90]]
[22]:
xxxxxxxxxx#By evaluating the models using different techniques to handle imbalanced data, you should notice the improvements in performance for detecting the minority class (fraudulent transactions). The class weights help the model give more importance to the minority class during training, and the SMOTE technique generates synthetic samples to balance the dataset, making the model better at identifying the minority class.[23]:
xxxxxxxxxx#Keep in mind that the choice between class weights and resampling techniques may vary based on the specific dataset and the characteristics of the problem. Experimenting with different approaches and evaluating their performance is essential in addressing the challenges posed by imbalanced data.[24]:
xxxxxxxxxx#In practice, you might also want to consider other techniques such as ensemble methods (e.g., RandomForest, Gradient Boosting) with imbalanced data handling to further improve the model's performance. Additionally, feature engineering, hyperparameter tuning, and feature selection are also crucial aspects of the machine learning process that can impact the model's effectiveness.[ ]:
xxxxxxxxxx Kernel status: Idle Executed 2 cellsElapsed time: 49 seconds
[27]:
import numpy as npimport pandas as pdfrom sklearn.datasets import load_winefrom pandas.plotting import scatter_matriximport matplotlib.pyplot as plt[14]:
data = load_wine()wine = pd.DataFrame(data.data, columns=data.feature_names)[15]:
print(wine.shape)(178, 13)
[16]:
print(wine.columns)Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
'proanthocyanins', 'color_intensity', 'hue',
'od280/od315_of_diluted_wines', 'proline'],
dtype='object')
[17]:
print(wine.iloc[:, :3].describe())alcohol malic_acid ash count 178.000000 178.000000 178.000000 mean 13.000618 2.336348 2.366517 std 0.811827 1.117146 0.274344 min 11.030000 0.740000 1.360000 25% 12.362500 1.602500 2.210000 50% 13.050000 1.865000 2.360000 75% 13.677500 3.082500 2.557500 max 14.830000 5.800000 3.230000
[18]:
print(wine.describe()) alcohol malic_acid ash alcalinity_of_ash magnesium \
count 178.000000 178.000000 178.000000 178.000000 178.000000
mean 13.000618 2.336348 2.366517 19.494944 99.741573
std 0.811827 1.117146 0.274344 3.339564 14.282484
min 11.030000 0.740000 1.360000 10.600000 70.000000
25% 12.362500 1.602500 2.210000 17.200000 88.000000
50% 13.050000 1.865000 2.360000 19.500000 98.000000
75% 13.677500 3.082500 2.557500 21.500000 107.000000
max 14.830000 5.800000 3.230000 30.000000 162.000000
total_phenols flavanoids nonflavanoid_phenols proanthocyanins \
count 178.000000 178.000000 178.000000 178.000000
mean 2.295112 2.029270 0.361854 1.590899
std 0.625851 0.998859 0.124453 0.572359
min 0.980000 0.340000 0.130000 0.410000
25% 1.742500 1.205000 0.270000 1.250000
50% 2.355000 2.135000 0.340000 1.555000
75% 2.800000 2.875000 0.437500 1.950000
max 3.880000 5.080000 0.660000 3.580000
color_intensity hue od280/od315_of_diluted_wines proline
count 178.000000 178.000000 178.000000 178.000000
mean 5.058090 0.957449 2.611685 746.893258
std 2.318286 0.228572 0.709990 314.907474
min 1.280000 0.480000 1.270000 278.000000
25% 3.220000 0.782500 1.937500 500.500000
50% 4.690000 0.965000 2.780000 673.500000
75% 6.200000 1.120000 3.170000 985.000000
max 13.000000 1.710000 4.000000 1680.000000
[26]:
xxxxxxxxxxtarget_names = data.target_namestarget = data.targetcolors = [target_names[i] for i in target]scatter_matrix(wine, alpha=0.8, figsize=(12, 12), diagonal='hist', color ='c')#plt.savefig("plot.png")plt.legend(target_names, loc='upper right')plt.show()[22]:
scatter_matrix(wine.iloc[:,:5])plt.show()[ ]:
- Classification_Dsci.ipynb
- Untitled1.ipynb
- Time series analysis.ipynb
- Imbalance Data.ipynb
[16]:
xxxxxxxxxxprint(wine.columns)Advanced Tools
xxxxxxxxxxxxxxxxxxxx-
Variables
Callstack
Breakpoints
Source
xxxxxxxxxx1